Newspaper Text Analysis
Part 1: Convert Images to Text
The first task was to parse the images using OCR (I used tesseract) and the layout information provided along with the task.
Here is a sketch of the approach I followed:
Make a sequence of category Ids corresponding to annotations arranged by their reading orders
Mark points where the last “article” type appears
Bunch annotations together using these “end points” and make a list of list of article contents
Parse each type of contents of the article and iterate over all articles to generate the output in the specified format.
Note: Newspaper text corpus (or any text from sources with a column layout) with a column layout has a special structure – words which are incomplete at the end of a line end with a “-“ followed by a line break. So, I used a regular expression to clean this feature out.
/ -((\n+\r+)|(\n\r)+|((\n)+)) /
Part 2: Run BERT on the Text
How do you deal with the noise (e.g., spelling mistakes, punctuation errors) in the OCRed text? Is this noise a problem for BERT?
OCR errors can lead to a lot of problem in conventional text analysis. Punctuation is often parsed incorrectly.
Resilience of BERT to spelling errors
Bert has a built-in counter to Out of Vocabulary words (OOV)- it ensures that the tokenizer has learned the “sub-words”. When the tokenizer fails to recognize a word (it is OOV) it will try to split that word up into smaller “pieces” that it does have in its vocabulary. The BERT tokenizer uses WordPiece to split these tokens. So, if there is a spelling error, BERT can handle it better than any fixed dictionary-based method.
Resilience of BERT to punctuation errors
BERT models outperform older RNN based models in terms of better generalizing to punctuations. Moreover, the authors of this paper show that irrelevant changes in punctuation are correctly ignored by the recent transformer models (BERT) Of course, this resilience is what makes BERT so useful- it eliminates the need for a tiring data cleaning pipe line that includes punctuation removal, lemmatisation etc. Lemmatisation in fact harms the model due to how it handles out of vocabulary words! (WordPiece)
Some noise does affect BERT – compared to noise-less data. This paper shows how noise can impact tasks which use BERT significantly. Overall, I have let BERT take a stab at the raw data itself apart from one change that I mentioned above – I clean up a hyphen before a line break when a word is left incomplete before the line changes. I also convert multiple line breaks and spaces to a single space. Punctuation can be important to the dataset, so I don’t remove it (Even emojis are handled well by BERT!).
How do you handle embedding very long articles?
The pre-trained model we are using can only take up to 512 tokens. Thus, longer articles can’t be read fully. We have multiple options:
Cut the longer texts off and only use the first 512 Tokens
Take first 256 and last 256 tokens
Convert longer text (>512) to mini-batches, feed it to the model, get embeddings out, then concatenate the matrices – and then pool over all tokens
Use only headlines
I preferred using the 1st method over the 2nd because of how the news is presented to the reader generally. The heading and the first few lines usually lay out the facts and the subject (the topic!) of the article. Compared to say, movie reviews which generally have a lot of relevant information towards the end (especially for sentiment analysis tasks). The third approach sounds good, but only few of our articles really breach the limit by a lot. Since we have to pool over tokens anyway, it is not clear to me why the average of tokens after the first 512 would contain more topic-modeling relevant information than the pooled embedding of the first 512. Here is an insightful discussion which also discusses how the gains using this technique were only marginal.
If we assume that the conclusions of an article contain information similar to the beginning, there is no need of getting tokens out from the residual part. If conclusions contain valuable information, there is some value in doing so. I choose the first approach of truncating to the first 512 tokens in this exercise. In any case, building the method with the mini-batch option is not difficult and just involves some extra steps that involve creating batches, then running it through BERT and then separating and pooling their matrices accordingly.
Headlines can be very insufficient
Headlines, apart from being small in token size, can be downright misleading. “Terror on the street!” can describe actual terrorists, or an escaped leopard or bear. While the first sounds like an event of national importance, the latter generally features in the local news. I therefore did not consider using headlines as my text corpus.
How do you synthesize token-level embeddings into sentence-level embeddings or document-level embeddings?
Sentence BERT package has been shown to do well in the sentence-level embedding task. Since the challenge was to brute force the process – I built a similar approach from the ground-up.
Pooling
CLS tokens which are already given out by BERT are mostly used for classification. In order to convert token level embedding to sentence/document level, I chose to take an average of the embedding across tokens for the document. (mean-pooling)
Part 3: Visualization and Analysis
Dimensionality reduction - using UMAP
In my analysis, due to the option of using larger embedding matrices, a good dimensionality reduction becomes even more important as standard techniques like Principal component analysis do worse on large dimension sizes (in the 5 layer case, they go up to 3840). PCA is highly influenced by outliers present in the data. PCA is a linear projection, which means it can’t capture non-linear dependencies . I thus chose a relatively new technique called UMAP (Uniform Manifold Approximation and Projection). Not only does it do a great job at Dimensionality Reduction, it synergizes well with the clustering algorithm I chose – HDBSCAN. UMAP is also faster than T-distributed stochastic neighbor embedding (t-SNE) while giving comparable results. It captures global structure better than t-SNE. There could have been some merit in applying PCA before UMAP to reduce some noise, but the results didn’t change much. The arg. pca_before_umap=False for the chosen output.
Clustering - HBDSCAN over the rest
While there are many clustering algorithms – both new and old, I went with HDBSCAN with works well with UMAP. I also have an option in my function to allow for K-means clustering, but I didn’t choose it as the final clustering method due to sub-par performance. There were also other considerations why I choose HDBSCAN over other algorithms. K-means requires clusters to be spherical, equally sized, equally dense with most density at the centre. It also “forces” the allocation of even outliers in the formed clusters. Most importantly, we have to specify the required number of clusters.
HDBSCAN on the other hand, has a solution to all of these problems. It handles clusters with different shapes, sizes and densities and even allows for outliers. Most importantly, we don’t need to specify the number of clusters a priori. It is also faster and better than methods like Spectral Clustering (Need to specify the number of clusters), Agglomerative Clustering (Requires more globular clusters and doesn’t allow for outliers) and DBSCAN (Worse than HDBSCAN only on one front- it doesn’t handle clusters with varying density well)
Choosing the final set of parameters
I went over the results of clustering by varying the embedding (number of layers concatenated) (last 1-5 layers) , number of dimensions reduced to (after Umap) [2-31] and the minimum size of clusters (during clustering) (6 and 8). If I clustered in higher dimensions, I used UMAP for the second time to bring it to the dimension needed for visualization. Because of my choice of clustering algorithm (HDBSCAN), I was able to cluster in relatively higher dimensions (PCA does very poorly very quickly as dimensions increase). I tried 5 (layers options) * 2 (min. cluster size) * 30 (clustering dimension options) = 300 combinations. While the results had some very prominent clusters across parameters, I finally went with ones that seemed to form meaningful categories overall without making too many categories which could suggest overfitting. For demonstration purposes, I have chosen to work with the embedding that used the last hidden layer, reduced to 16 dimensions before clustering, and then reduced to 3 dimensions for visualisation after clustering and with a minimum cluster size of 6 (for HDBSCAN).
Plotting the embeddings with cluster labels
The articles form seven, very well-defined clusters which seem to be appropriate for the period from when the newspaper scans originate. I have annotated the clusters with labels that can broadly define the theme. In order to look deeper in the cluster, I used cluster-level TF-IDF to figure out the most important words in the clusters an approach also used in the library BERTopic. Clusters are collapsed into one single document and then TF-IDF is used to rank words by their importance in that cluster. The following figure contains the top 20 words sorted by their TF-IDF scores.
A Deep-dive into the clusters
Understanding the Topics (clusters)
Using the top-words distribution, we can label clusters into very intuitive topics.
- 0: City/State news
- This covers mostly news related to city/state-level issues
- 1: Entertainment, Trivia, Local news
- This spans across local and entertainment news covering local events and trivia
- 2: National News
- This cluster contains news related to national-level politics as well as figures of national importance. Interesting fact: Zsa Zsa Gabor, a very famous actress appears in this cluster showing how important celebrity gossip was even then.
- 3: Economy and labor
- Covers issues related to the Economy (national/international) - prices, labor relations and economy-related legislation
- 4: Crime and Loss of Life
- This seems to contain news (national/international) related to criminal activities, accidents and disasters (like the volcanic eruption vestmannaeyjar)
- 5: International news, cold war era dominated
- Typical of the time, this contains mostly news related to the cold-war discourse. Covering both military activity and commentaries on communism
- 6: Presidential Politics and presidents
- This cluster has articles focusing on presidents and presidential politics. This is expected of the time as these newspapers are of the the watergate era.
The distance between these topics can be visualized well on a 2-dimensional plane. News related to presidents and the Economy/labor seem to be very closely related, hinting towards the importance of these issues at the time. Since some gossip goes to the national level, national news is close to local news as well. Of course, this is a 2-d representation, for clustering that happened in 16 dimensions. Thus, this figure is not sufficient to conclusively comment upon the relationship among topics.
Part 4: Challenges faced and other ideas
Lack of Data
The dataset used in this exercise has very few articles. This affects clustering because we need to keep the minimum cluster size low (for HDBSCAN) as even a size of 10 is ~10% of the data. Similarly, we need to keep n_neighbours (UMAP) low and that preserves too much local structure which can cause overfitting. As HDBSCAN allows for outliers which don’t fit in any cluster, I had to keep min. samples (HDBSCAN) also very low to avoid pushing too much of our data points into the outlier category.
Scope for looking at the evolution of topics
The economy cluster seems to focus a lot on labour issues. It would be interesting to look at the evolution of this sub-topic with a larger dataset with cuts along time - especially before and after the fall of the Soviet Union. Dynamic topic modeling will be interesting otherwise as well- to see how the distribution of articles with regards to their topics changes over time.
Scope for even more exploration
While I looked at the results of >300 combinations of parameters,even with the options I have provided in my functions, there can be thousands of combinations. As the size of the dataset increases, finer clusters can come out. Making it finr right now would have meant n=more nose.